{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### INSTRUCTIONS AND RECOMMENDATIONS FOR CREATING YOUR OWN DATABASE FOR PHYLOGENETIC STUDIES"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A database for phylogenetic study must contain:\n",
    "- a BLAST database with the chosen proteomes;\n",
    "- a file linking a gene to its protein products (g2r.tsv);\n",
    "- a file linking a taxon identifier to scientific name of an organism (names.dmp)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to get the most complete proteomes from the RefSeq database, so if you know you will be using your own set of proteomes, please, modify the algorithm accordingly. Moreover, we will need BLAST+ as well as Entrez Direct utilities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the algorithm is not intended for studying prokaryotic genes since these proteomes often use non-redundant protein accession numbers (with the prefix 'WP_')."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Getting the necessary files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, download the current proteomes list from RefSeq with the corresponding BUSCO completeness scores (you will need Entrez Direct installed)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "mkdir refseq_proteomes\n",
    "cd refseq_proteomes\n",
    "esearch -db assembly -query 'has_egap_annotation[prop] AND \"latest refseq\"[filter]' \\\n",
    "  | esummary \\\n",
    "  | xtract \\\n",
    "    -pattern DocumentSummary \\\n",
    "    -element \\\n",
    "      AssemblyAccession,\\\n",
    "      Organism,\\\n",
    "      Taxid,\\\n",
    "      Busco/BuscoLineage,\\\n",
    "      Busco/TotalCount,\\\n",
    "      Busco/Complete,\\\n",
    "      Busco/SingleCopy,\\\n",
    "      Busco/Duplicated,\\\n",
    "      Busco/Fragmented,\\\n",
    "      Busco/Missing \\\n",
    "> busco_scores.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, get the taxdump.tar.gz."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd refseq_proteomes\n",
    "mkdir taxonomy\n",
    "cd taxonomy\n",
    "wget -q ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz -O taxdump.tar.gz \n",
    "sleep 1\n",
    "tar -xzf taxdump.tar.gz\n",
    "rm taxdump.tar.gz\n",
    "rm citations.dmp\n",
    "rm delnodes.dmp\n",
    "rm division.dmp\n",
    "rm gencode.dmp\n",
    "rm images.dmp\n",
    "rm merged.dmp\n",
    "rm readme.txt\n",
    "rm gc.prt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Find the most complete proteomes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We don't need to take all proteomes into the analysis: it is sufficient to take in several most complete proteomes from each class or another taxonomic rank."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will take 3 most complete proteomes from classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets load nodes.dmp and names.dmp for managing taxonomic ranks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "sep = '\\t\\|\\t'\n",
    "\n",
    "nodes = pd.read_table(\n",
    "    'refseq_proteomes/taxonomy/nodes.dmp',\n",
    "    sep = sep,\n",
    "    header = None,\n",
    "    engine ='python'\n",
    ")\n",
    "\n",
    "nodes.columns = [\n",
    "\t'Taxid',\n",
    " \t'Parent',\n",
    " \t'Rank',\n",
    " \t'EMBL',\n",
    " \t'Division',\n",
    " \t'Inherited_div',\n",
    " \t'Gencode',\n",
    " \t'Inherited_gencode',\n",
    " \t'Mito',\n",
    " \t'Inherited_mito',\n",
    " \t'GenBank_hidden',\n",
    " \t'Hidden_subtree',\n",
    " \t'Comments'\n",
    "]\n",
    "\n",
    "names = pd.read_table(\n",
    "    'refseq_proteomes/taxonomy/names.dmp',\n",
    "    sep = sep,\n",
    "    header = None,\n",
    "    engine ='python'\n",
    ")\n",
    "\n",
    "names.columns = [\n",
    "\t'Taxid',\n",
    "\t'Name',\n",
    "\t'Unique',\n",
    "\t'Class'\n",
    "]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Saving all identifiers of classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classes_ids = nodes[nodes['Rank'] == 'class']['Taxid'].to_list()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also have to use a sorted (by a complete BUSCO score -- column 5 -- in our case) file with BUSCO scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "busco = pd.read_table(\n",
    "    'refseq_proteomes/busco_scores.tsv',\n",
    "    header = None,\n",
    "    sep = '\\t'\n",
    ")\n",
    "\n",
    "busco.columns = [\n",
    "    'GCF',\n",
    "    'Name',\n",
    "    'Taxid',\n",
    "    'Lineage',\n",
    "    'Count',\n",
    "    'Score',\n",
    "    'Single',\n",
    "    'Dupl',\n",
    "    'Fragm',\n",
    "    'Miss'\n",
    "]\n",
    "busco = busco.dropna()\n",
    "busco = busco.sort_values(by = ['Score'], ascending = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets use the taxid column as a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_class(current_taxid):\n",
    "    if not (current_taxid in classes_ids):\n",
    "        current_taxid = nodes[nodes['Taxid'] == current_taxid]['Parent'].to_list()\n",
    "        if len(set(current_taxid)) > 1:\n",
    "            raise Exception('Taxid', current_taxid, 'has multiple parents')\n",
    "        elif len(set(current_taxid)) == 0:\n",
    "            raise Exception('Warning! Root-tracing failed')\n",
    "        elif current_taxid[0] == 1:\n",
    "            return 0\n",
    "        else:\n",
    "            return find_class(current_taxid[0])\n",
    "    else:\n",
    "        return current_taxid\n",
    "\n",
    "busco['Class'] = busco['Taxid'].map(find_class).astype(int)\n",
    "print(len(busco['Class'].unique()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There were 36 classes in total, and we are going to obtain up to 3*36=111 proteomes. Note that 0s indicate organisms that do not have class according to the nodes.dmp file. Lets see what these are."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "busco[busco['Class'] == 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the original paper, we have used other taxonomic ranks for these species. Now we are going to just take the top 3 proteomes with 0s, as well as Latimeria chalumnae and Protopterus annectens and human."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "proteomes_list = [\n",
    "    busco[busco['Taxid'] == 7897]['GCF'].to_list()[0],\n",
    "    busco[busco['Taxid'] == 7888]['GCF'].to_list()[0],\n",
    "    busco[busco['Taxid'] == 9606]['GCF'].to_list()[0]\n",
    "]\n",
    "# The top 3 with 0s will be added later"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets just take the top 3 proteomes in each class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for name, class_df in busco.groupby('Class'):\n",
    "    class_df = class_df.sort_values(by = ['Score'], ascending = False)\n",
    "    class_proteomes = list()\n",
    "    taxids = set()\n",
    "    for index, row in class_df.iterrows():\n",
    "        if not (row['Taxid'] in taxids):\n",
    "            class_proteomes.append(row['GCF'])\n",
    "        taxids.add(row['Taxid'])\n",
    "    proteomes_list += class_proteomes[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets write down obtained assembly identifiers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./refseq_proteomes/assembly_ids.txt', 'w') as out:\n",
    "    out.write('\\n'.join(list(set(proteomes_list))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Download proteomes, create and configure your BLAST database."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NCBI Datasets tools might be used to download the proteomes but it is more convenient to use batch Entrez option.\n",
    "- Go to https://www.ncbi.nlm.nih.gov/sites/batchentrez;\n",
    "- Choose the Assembly Database;\n",
    "- Select the \"assembly_ids.txt\" file;\n",
    "- Click \"Retrieve\";\n",
    "- Click \"Retrieve records for ... UID(s)\";\n",
    "- Click \"Download Assemblies\", then choose the RefSeq database and \"Protein FASTA (.faa)\" file type;\n",
    "- Create the \"./refseq_proteomes/assembly_files\" folder and place TAR files there;\n",
    "- Click \"Download Assemblies\", then choose the RefSeq database and \"Feature table (.txt)\" file type;\n",
    "- Place TAR files in the newly created \"./refseq_proteomes/assembly_files\" folder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets manage the obtained files and concatenate all FASTA files and feature tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd ./refseq_proteomes/assembly_files\n",
    "tar -xf genome_assemblies_features.tar\n",
    "tar -xf genome_assemblies_prot_fasta.tar\n",
    "mkdir ../busco_refseq\n",
    "\n",
    "# Making a single FASTA file\n",
    "> ../busco_refseq/busco_refseq.fasta\n",
    "for fasta_file in ./*/*.faa.gz\n",
    "do\n",
    "  zcat $fasta_file >> ../busco_refseq/busco_refseq.fasta\n",
    "done\n",
    "\n",
    "# Header for feature table\n",
    "for ft_file in ./*/*.txt.gz\n",
    "do\n",
    "  zcat $ft_file | head -n 1 | awk '{ \\\n",
    "      print $3,$11,$15,$16; \\\n",
    "    }' FS='\\t' OFS='\\t' \\\n",
    "    > ../busco_refseq/busco_feature_table.txt\n",
    "  break\n",
    "done\n",
    "\n",
    "# Feature table\n",
    "for ft_file in ./*/*.txt.gz\n",
    "do\n",
    "  zcat $ft_file | awk '{ \\\n",
    "      if ($1 == \"CDS\" && $2 == \"with_protein\") \\\n",
    "        print $3,$11,$15,$16; \\\n",
    "    }' FS='\\t' OFS='\\t' \\\n",
    "    >> ../busco_refseq/busco_feature_table.txt\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking if FASTA files and the feature table are fully compatible..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import gzip\n",
    "from glob import glob\n",
    "\n",
    "ft = pd.read_csv('refseq_proteomes/busco_refseq/busco_feature_table.txt', sep = '\\t')\n",
    "\n",
    "assemblies = set()\n",
    "for path in glob('refseq_proteomes/assembly_files/*/*.faa.gz'):\n",
    "    filename = os.path.basename(path)\n",
    "    assembly = '_'.join(filename.split('_')[0:2])\n",
    "    assemblies.add(assembly)\n",
    "\n",
    "if assemblies == set(ft['assembly']) == set(proteomes_list):\n",
    "    print(\"Assemblies list: ok\")\n",
    "else:\n",
    "    raise Exception(\"FASTA file and assembly list do not match\")\n",
    "\n",
    "for path in glob('refseq_proteomes/assembly_files/*/*.faa.gz'):\n",
    "    filename = os.path.basename(path)\n",
    "    with gzip.open(path, 'r') as inp:\n",
    "        proteome = inp.readlines()\n",
    "    proteome = [p.decode('utf-8').split()[0][1:] for p in proteome if p.startswith(b'>')]\n",
    "    assembly = '_'.join(filename.split('_')[0:2])\n",
    "    curr_df = ft[ft['assembly'] == assembly]\n",
    "    if set(proteome) == set(curr_df['product_accession']):\n",
    "        print(assembly + ': ok')\n",
    "    else:\n",
    "        raise Exception(assembly + ': protein accession numbers from the feature table and FASTA files do not match')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adding the taxonomy information to the feature table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "busco = busco[['GCF', 'Taxid']]\n",
    "\n",
    "taxid_dict = dict()\n",
    "for index, b in busco.iterrows():\n",
    "    taxid_dict[b['GCF']] = b['Taxid']\n",
    "\n",
    "ft['#tax_id'] = ft['assembly'].map(taxid_dict)\n",
    "\n",
    "if len(ft['#tax_id'].unique()) != len(ft['assembly'].unique()):\n",
    "    raise Exception('Check the tax_id and assembly correspondence')\n",
    "\n",
    "g2r = ft\n",
    "\n",
    "g2r = g2r.rename(columns = {\n",
    "    'product_accession': 'protein_accession.version',\n",
    "    'symbol': 'Symbol'\n",
    "})\n",
    "\n",
    "g2r.to_csv('refseq_proteomes/g2r.tsv', sep = '\\t', index = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, lets create a BLAST search database (provide path to your makeblastdb binary if necessary)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd ./refseq_proteomes/busco_refseq\n",
    "makeblastdb \\\n",
    "  -dbtype prot \\\n",
    "  -in ./busco_refseq.fasta \\\n",
    "  -title busco_refseq \\\n",
    "  -parse_seqids \\\n",
    "  -out busco_refseq \\\n",
    "  -max_file_sz 2GB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Edit the .ncbirc file accordingly (insert your paths if necessary)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "busco_refseq_path=$(readlink -f ./refseq_proteomes/busco_refseq)\n",
    "echo \"[BLAST]\" > ~/.ncbirc\n",
    "echo \"BLASTDB=${busco_refseq_path}\" >> ~/.ncbirc\n",
    "echo \"DATA_LOADERS=blastdb\" >> ~/.ncbirc\n",
    "echo \"BLASTDB_PROT_DATA_LOADER=busco_refseq\" >> ~/.ncbirc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Edit the configuration file (replace blastdbcmd_path and blastp_path if necessary)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "refseq_proteomes_path=$(readlink -f ./refseq_proteomes)\n",
    "blastdbcmd_path=$(which blastdbcmd)\n",
    "blastp_path=$(which blastp)\n",
    "echo \"# Local 'gene2refseq' file\" > ./cogconf.txt\n",
    "echo \"path2G2R:${refseq_proteomes_path}/g2r.tsv\" >> ./cogconf.txt\n",
    "echo \"# If you use refseq, download this file and unzip, then provide names.dmp file: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz\" >> ./cogconf.txt\n",
    "echo \"path2T2N:${refseq_proteomes_path}/taxonomy/names.dmp\" >> ./cogconf.txt\n",
    "echo \"# Name of database with representative taxids\" >> ./cogconf.txt\n",
    "echo \"databaseName:busco_refseq\" >> ./cogconf.txt\n",
    "echo \"# Path to Blastp utility\" >> ./cogconf.txt\n",
    "echo \"path2blastp:${blastp_path}\" >> ./cogconf.txt\n",
    "echo \"# Path to BlastDBCmd utility\" >> ./cogconf.txt\n",
    "echo \"blastdbcmd:${blastdbcmd_path}\" >> ./cogconf.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your database is ready for work! Please, create a conda environment for this tool:\n",
    "```\n",
    "conda env create -n [NEW_ENV_NAME] -f dependencies.yml\n",
    "```\n",
    "...activate it...\n",
    "```\n",
    "conda activate [NEW_ENV_NAME]\n",
    "```\n",
    " and test your installation (the algorithm uses 40 threads by default; enter your value in the \"-t\" parameter -- we use 10 threads in this example):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "echo \"NP_001104262\" > ./refseq_proteomes/test_input.txt\n",
    "# Optional: uncomment the next line to analyze two genes one by one.\n",
    "# echo \"NP_001278\" >> ./refseq_proteomes/test_input.txt\n",
    "\n",
    "# Observe the test_input.txt for input file example\n",
    "\n",
    "python3 ./cog.py \\\n",
    "    ./refseq_proteomes/test_input.txt \\\n",
    "    ./refseq_proteomes/test_output \\\n",
    "    -t 10"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cog",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
